Machine Learning Methods

Clustering and Predictive Modeling for Job Market Analysis

1 Introduction

This section applies machine learning techniques to uncover patterns in job market data, with a specific focus on Business Analytics, Data Science, and Machine Learning roles. As job seekers entering these competitive fields in 2024, understanding the hidden structures in job postings, predicting salary ranges, and identifying role characteristics can provide strategic advantages in career planning.

We employ three complementary machine learning approaches:

  1. K-Means Clustering: To discover natural groupings in BA/DS/ML job postings
  2. Regression Models: To predict salary ranges based on job characteristics
  3. Classification Models: To distinguish between different role types
Dataset loaded: 59,220 rows, 56 columns

2 Data Filtering for BA/DS/ML Analysis

To focus our analysis on relevant career paths for Business Analytics, Data Science, and Machine Learning professionals, we filter the dataset to include only positions matching these disciplines.


Filtered to BA/DS/ML jobs: 15,378 postings
Percentage of total dataset: 25.97%

Top 10 Job Titles:
TITLE_NAME
Data Analysts                          6409
ERP Business Analysts                   369
Data Analytics Engineers                343
Data Analytics Interns                  328
Lead Data Analysts                      319
Data Analytics Analysts                 256
Master Data Analysts                    234
Business Intelligence Data Analysts     223
IT Data Analytics Analysts              221
SAP Business Analysts                   206
Name: count, dtype: int64

3 Feature Engineering

Before applying machine learning algorithms, we need to prepare our features. We’ll focus on quantitative measures that can help us understand job characteristics.

Feature Summary:
          AVG_SALARY  EXPERIENCE_YEARS  DURATION_DAYS     IS_REMOTE
count   15378.000000      15378.000000   15378.000000  15378.000000
mean    95154.236185          4.564053      20.494863      0.188256
std     24709.977582          2.199538      11.234962      0.390929
min     40000.000000          0.000000       0.000000      0.000000
25%     78307.982976          3.000000      15.000000      0.000000
50%     95015.535141          5.000000      18.000000      0.000000
75%    111787.084948          5.000000      23.000000      0.000000
max    193155.942661         15.000000      59.000000      1.000000

4 K-Means Clustering Analysis

Clustering helps us discover natural groupings in the job market. Different clusters might represent entry-level vs. senior positions, different specializations, or regional variations.

4.1 Elbow Method for Optimal K

Clustering dataset: 15,378 samples

Elbow Method for Determining Optimal Number of Clusters

Inertia values by K:
    K       Inertia
0   2  46083.334533
1   3  37098.092228
2   4  30324.433241
3   5  24516.626582
4   6  22230.434222
5   7  20302.311134
6   8  18622.328855
7   9  17146.137474
8  10  16041.649025

4.2 Apply K-Means with Optimal K


Clustering complete with K=4

Cluster distribution:
Cluster
0    5057
1    5287
2    2282
3    2752
Name: count, dtype: int64

Cluster Characteristics:
        AVG_SALARY            EXPERIENCE_YEARS DURATION_DAYS IS_REMOTE
              mean     median             mean          mean      mean
Cluster                                                               
0        114309.83  111978.57             3.88         16.26      0.00
1         77114.70   78684.40             5.37         15.91      0.00
2         95016.68   94501.34             4.39         41.52      0.06
3         94725.13   95721.16             4.41         19.64      1.00

4.3 PCA Visualization of Clusters

PCA Visualization of Job Clusters in BA/DS/ML Market

Variance explained:
PC1: 26.95%
PC2: 25.08%
Total: 52.03%

5 Regression Analysis: Salary Prediction

Understanding what factors drive salary differences can help job seekers negotiate better compensation and target high-paying opportunities.

5.1 Data Preparation for Regression

Regression dataset: 15,378 samples, 13 features
Salary range: $40,000 - $193,156
Median salary: $95,016

5.2 Multiple Linear Regression

MULTIPLE LINEAR REGRESSION RESULTS
==================================================
RMSE: $24,491.21
MAE: $19,614.15
R² Score: -0.0011

Model explains -0.11% of salary variance

5.3 Random Forest Regression

RANDOM FOREST REGRESSION RESULTS
==================================================
RMSE: $24,827.40
MAE: $19,926.56
R² Score: -0.0288

Model explains -2.88% of salary variance

5.4 Regression Model Comparison

Comparison of Salary Prediction Models

6 Classification: Role Type Prediction

Understanding the distinguishing characteristics of different role types can help job seekers tailor their applications and skill development.

6.1 Create Role Categories

Classification dataset: 14,143 samples

Role distribution:
ROLE_CATEGORY
Data Analytics        11944
Business Analytics     1776
Data Science            419
Machine Learning          4
Name: count, dtype: int64

Percentages:
ROLE_CATEGORY
Data Analytics        84.451672
Business Analytics    12.557449
Data Science           2.962596
Machine Learning       0.028283
Name: count, dtype: float64

6.2 Prepare Classification Features

Classification features: 14
Samples per class:
ROLE_CATEGORY
Data Analytics        11944
Business Analytics     1776
Data Science            419
Machine Learning          4
Name: count, dtype: int64

6.3 Logistic Regression Classification

LOGISTIC REGRESSION CLASSIFICATION
==================================================
Accuracy: 0.8407 (84.07%)
F1 Score (Weighted): 0.7750

Classification Report:
                    precision    recall  f1-score   support

Business Analytics       0.23      0.02      0.03       533
    Data Analytics       0.85      0.99      0.91      3583
      Data Science       0.00      0.00      0.00       126
  Machine Learning       0.00      0.00      0.00         1

          accuracy                           0.84      4243
         macro avg       0.27      0.25      0.24      4243
      weighted avg       0.74      0.84      0.78      4243

6.4 Random Forest Classification

RANDOM FOREST CLASSIFICATION
==================================================
Accuracy: 0.8562 (85.62%)
F1 Score (Weighted): 0.8151

Classification Report:
                    precision    recall  f1-score   support

Business Analytics       0.64      0.17      0.27       533
    Data Analytics       0.86      0.99      0.92      3583
      Data Science       0.89      0.06      0.12       126
  Machine Learning       0.00      0.00      0.00         1

          accuracy                           0.86      4243
         macro avg       0.60      0.31      0.33      4243
      weighted avg       0.84      0.86      0.82      4243

6.5 Classification Model Comparison

Classification Model Performance Comparison

7 Key Insights for Job Seekers

ImportantMachine Learning Insights for BA/DS/ML Career Planning

Clustering Insights:

  • K-Means clustering reveals {optimal_k} distinct segments in the BA/DS/ML job market
  • Jobs naturally group by salary level, experience requirements, and remote work options
  • Understanding which cluster you target can help focus your job search

Salary Prediction Findings:

  • Random Forest model (R²={r2_rf:.4f}) outperforms Linear Regression (R²={r2_mlr:.4f})
  • Experience level and location are the strongest salary predictors
  • Remote positions show different salary patterns than on-site roles
  • Feature importance analysis reveals which skills and qualifications drive higher compensation

Role Classification Results:

  • Models achieve {acc_rf_clf*100:.1f}% accuracy in distinguishing between BA, DS, and ML roles
  • Each role type has distinct feature patterns
  • Understanding these patterns helps tailor applications and skill development
  • Business Analytics and Data Science roles show the most overlap